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Abstract 

Background: Answering questions such as "Which genes are related to breast cancer?" usually requires retrieving 
relevant publications through the PubMed search engine, reading these publications, and creating gene lists. This 
process is not only time-consuming, but also prone to errors. 

Results: We report GLAD4U (Gene List Automatically Derived For You), a new, free web-based gene retrieval and 
prioritization tool. GLAD4U takes advantage of existing resources of the NCBI to ensure computational efficiency. 
The quality of gene lists created by GLAD4U for three Gene Ontology (GO) terms and three disease terms was 
assessed using corresponding "gold standard" lists curated in public databases. For all queries, GLAD4U gene lists 
showed very high recall but low precision, leading to low F-measure. As a comparison, EBIMed's recall was 
consistently lower than GLAD4U, but its precision was higher. To present the most relevant genes at the top of a 
list, we studied two prioritization methods based on publication count and the hypergeometric test, and 
compared the ranked lists and those generated by EBIMed to the gold standards. Both GLAD4U methods 
outperformed EBIMed for all queries based on a variety of quality metrics. Moreover, the hypergeometric method 
allowed for a better performance by thresholding genes with low scores. In addition, manual examination suggests 
that many false-positives could be explained by the incompleteness of the gold standards. The GLAD4U user 
interface accepts any valid queries for PubMed, and its output page displays the ranked gene list and information 
associated with each gene, chronologically-ordered supporting publications, along with a summary of the run and 
links for file export and functional enrichment and protein interaction network analysis. 

Conclusions: GLAD4U has a high overall recall. Although precision is generally low, the prioritization methods 
successfully rank truly relevant genes at the top of the lists to facilitate efficient browsing. GLAD4U is simple to use, 
and its interface can be found at: http://bioinfo.vanderbilt.edu/glad4u. 




Genomics 



Background 

The physical development and phenotype of organisms 
can be thought of as a product of genes interacting with 
each other and with the environment. Therefore, it is 
common for a scientist to ask questions like "Which 
genes are related to breast cancer?", "Which genes are 
involved in embryonic development?", and "Which genes 
are functionally related to TP53?" 
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The current answers to these questions are primarily 
contained in the articles indexed in the MEDLINE data- 
base. Traditionally, answering these questions requires 
individuals to retrieve relevant publications through the 
PubMed search engine and then to create gene lists by 
manually extracting gene-centered information from 
retrieved literature. This process is not only time-consum- 
ing, but also prone to errors. First, it is difficult to ascertain 
that all relevant literature is processed. Second, it is unli- 
kely that all relationships in a publication will be detected. 
Third, individual researchers tend to extrapolate based on 
domain knowledge. 
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Over the past decade, bioinformatics approaches have 
been developed to address this issue. One of the most suc- 
cessful projects in this area is the Gene Ontology (GO) 
project [1]. GO produces a structured, precisely defined, 
and controlled vocabulary (i.e., GO terms) for describing 
the roles of genes and gene products in different species. 
Genes are associated with GO terms through manual 
curation as well as computational inference. A researcher 
can now go to the GO website [2] to get a list of genes 
related to a GO term of interest. However, as the GO 
vocabulary only describes gene products in terms of their 
associated biological processes, cellular components and 
molecular functions, users are limited by questions linked 
to this limited vocabulary. Moreover, processes, functions 
or components that are unique to diseases, such as onco- 
genesis, are not included in GO because causing cancer is 
not the normal function of any gene. 

A useful resource specifically designed for disease stu- 
dies is the Online Mendelian Inheritance in Man (OMIM 
[3]) project. OMIM is a comprehensive, authoritative, 
and timely compendium of human genes and genetic 
phenotypes. It contains information on all known Men- 
delian disorders. However, information on complex dis- 
eases such as cancer and diabetes is lacking in OMIM. 

In addition to manual curation, text mining tools have 
been developed to assist gene list creation [4]. As an 
example, EBIMed [5,6] combines text mining with co- 
occurrence-based analysis to generate a prioritized list of 
genes for a user-provided query. Specifically, EBIMed 
collects MEDLINE records and available full text docu- 
ments for a user-provided query, identifies protein 
names, drugs, species, or GO terms in the documents, 
and prioritizes genes/proteins based on the number of 
co-occurrences of the different pairs (protein/protein, 
protein/drug, protein/species, protein/GO term) in the 
sentences of the documents in which they appear. 
EBIMed and similar tools, such as FACTA [7] and SciMi- 
ner [8], provide more flexible ways to create gene lists 
that are not limited to certain aspects of biology. Never- 
theless, they usually require heavy computation, and the 
relevance of the resulted gene lists to the input queries 
has not been systematically evaluated. 

Here, we report GLAD4U (Gene List Automatically 
Derived For You), a new web-based gene retrieval and 
prioritization tool. GLAD4U takes advantage of existing 
resources at the National Center for Biotechnology Infor- 
mation (NCBI) to ensure computational efficiency. It 
provides a simple user interface that facilitates intuitive 
usage and interpretation of results. The quality of gene 
lists created by GLAD4U is assessed using corresponding 
"gold standard" lists curated in GO, GAD (Genetic Asso- 
ciation Database [9]), and OMIM. The performance of 
GLAD4U is also compared with EBIMed. 



Results 

Overall quality of the retrieved gene lists 

GLAD4U relies on the NCBI eSearch API to find publi- 
cations related to a user query and on the gene-to-publi- 
cation link table to identify genes from the retrieved 
publications. We used three GO biological process terms 
(apoptosis, cell adhesion and DNA repair) and three dis- 
ease terms (hypertension, obesity and schizophrenia) as 
queries to evaluate the overall quality of the retrieved 
gene lists. For each query, using a corresponding gene list 
curated by GO or GAD/OMIM as a gold standard, we 
calculated the precision, recall and F-measure of the 
retrieved gene list. As shown in Table 1, gene lists 
retrieved for all queries showed very high recall (0.90 
±0.03 for GO terms and 0.96±0.05 for disease terms). In 
contrast to the high recall, the precision was generally 
low (0.16±0.04 for GO terms and 0.06±0.02 for disease 
terms), leading to low F-measures (0.27±0.05 for GO 
terms and 0.12±0.03 for disease terms). EBIMed's recall 
is consistently lower than GLAD4U (0.47±0.15 for GO 
terms and 0.44±0.11 for disease terms). However, its pre- 
cision is higher than GLAD4U (0.20±0.05 for GO terms 
and 0.16±0.04 for disease terms), resulting in better 
F-measures (0.27±0.03 for GO terms and 0.23±0.04 for 
disease terms). 

The low precision of GLAD4U may be partially attrib- 
uted to the incompleteness of the annotation in GO and 
GAD/OMIM. However, it is likely that the original gene 
lists include many irrelevant genes. In this case, a priori- 
tization step that ranks truly relevant genes at the top 
of a list would certainly facilitate efficient browsing. 

Performance of the prioritization methods 

We studied the performance of two methods to prioritize 
the gene lists. The first, "GLAD4U Counts", is based solely 
on the number of supporting publications as commonly 
implemented in other software [10,11]. The second, 
"GLAD4U Hypergeometric", is proposed in this study, 
which is based on the Hypergeometric test (see the Meth- 
ods section for details). We used the above mentioned 
three GO terms and three disease terms as queries to eval- 
uate the performance of our prioritization methods. We 
also included the prioritized gene lists returned by 
EBIMed for comparison. 

Figure 1 depicts the precision/recall curves from this 
comparative evaluation. For all queries, based on manual 
inspection of the curves, both GLAD4U Counts and 
GLAD4U Hypergeometric outperformed EBIMed, espe- 
cially at the high precision range. Between the two 
GLAD4U methods, the Hypergeometric method per- 
formed better than the Counts method for GO term 
queries, while their performances were comparable for 
disease term queries. The superior overall performance 
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Table 1 Overall quality of the retrieved gene lists 



Query 


GO/ MIM gene count 


GLAD4U gene count 


EBIMed gene count 




GLAD4U 


EBIMed 


Apoptosis 


1039 


6037 (958) 


1469 (387) 


Precision 


0.1587 


0.2634 






[195715] 


[10000] 


Recall 


0.9220 


0.3725 










F-measure 


0.2708 


0.3086 


Cell adhesion 


785 


4195 (691) 


1 725 (305) 


Precision 


0.1647 


0.1769 






[125144] 


[10000] 


Recall 


0.8802 


0.3885 










F-measure 


0.2775 


0.2431 



DNA repair 


282 


2476 (263) 


1100 (180) 


Precision 


0.1062 


0.1636 






[60952] 


[10000] 


Recall 


0.9326 


0.6383 










F-measure 


0.1907 


0.2605 


Hypertension 


87 


2046 (77) 


135 (27) 


Precision 


0.0376 


0.2000 






[323818] 


[10000] 


Recall 


0.8851 


0.3103 










F-measure 


0.0721 


0.2432 


Obesity 


111 


1778 (110) 


350 (59) 


Precision 


0.0619 


0.1686 






[141615] 


[10000] 


Recall 


0.9910 


0.5315 










F-measure 


0.1165 


0.2560 


Schizophrenia 


94 


1725 (90) 


382 (44) 


Precision 


0.0522 


0.1152 






[91194] 


[10000] 


Recall 


0.9574 


0.4681 










F-measure 


0.0990 


0.1849 



Numbers in parentheses indicate the number of genes overlapping between the GLAD4U or EBIMed lists and the corresponding gold standard, numbers in 
square brackets indicate the number of publications retrieved by the query. 



of the two GLAD4U methods over EBIMed was further 
evaluated by computing AP, a quantitative measure of 
quality across all recall levels (Table 2). In this analysis, 
GLAD4U Counts and Hypergeometric methods scored 
better than EBIMed (0.48±0.10, 0.52±0.12 and 0.21±0.09, 
respectively), with GLAD4U Hypergeometric performing 
the best (Table 2). 

The precision-recall curve and the AP score factor in 
precision at all recall levels. For ranked gene lists, parti- 
cularly in web-based applications, this may not be of 
interest to users. In most scenarios, what matters may 
be the number of relevant genes on the first page or the 
first several pages. "Precision at k" is usually used to 
measure precision at a fixed low level of retrieved 
results, e.g., the top k results. To this end, we calculated 
the precisions for the top 50 {k = 50) and top 100 {k = 
100) genes for all three methods, for each query (Table 
2). GLAD4U Counts and GLAD4U Hypergeometric 
methods maintained higher precisions for the top 50 
genes compared to EBIMed (0.74±0.15, 0.77±0.20 and 
0.54±0.18, respectively), as well as for the top 100 genes 
(0.64±0.20, 0.69±0.25 and 0.42±0.20, respectively). 
Although the AP-based comparison may be biased 
against EBIMed owing to its low overall recall, precision 
at 50 and 100 only focus on the top ranking genes and 
are not affected by the overall recall. These results sug- 
gest that GLAD4U can produce lists where relevant 
genes are ranked at the top. 



Although precision was less than perfect even for the 
top ranking genes, we noticed that many false-positives 
could be explained by the incompleteness of the gold 
standards. Table 3 lists the first 10 genes-along with 
their first 10 supporting publications-returned by 
GLAD4U Hypergeometric method that were not in the 
corresponding gold standards for the terms "apoptosis" 
and "hypertension" (see additional files 1 and 2 for the 
complete lists of genes and supporting publications). 
Taking the first and last genes in the list as examples, 
for each term (i.e., MDM2 and ING1 for apoptosis, and 
REN and ACE2 for hypertension), we found strong evi- 
dence in the most recent supporting publications for 
linking these non-gold standard genes to the query. 
MDM2 has antiapoptotic effects, and its direct interac- 
tion and regulation of p53 define it as an oncogene 
[12-15]. It translocates to the nucleus to interact with 
p53 and p300, promotes cell growth by initiating p53 
degradation [16,17]. Its expression is directly linked to 
prostate cancer patient susceptibility [18]. Inhibitor of 
growth family, member 1 (ING1) is involved in cell 
stress and DNA damage response [19-22]. Up-regulation 
of p33INGlb or p24INGlc, two of the three alterna- 
tively spliced transcripts of ING1 resulted in increased 
early apoptotic cells [23,24], probably through interac- 
tions with mdm2, pl4arf, and lamin A [25,26]. This 
effect is dependent on the presence of functional p53 
[25,27] and the H3K3me3 binding domain of IGN1 [28]. 
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Obesity 
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test 

EBIMed 




1 



"T" 



DNA Repair 




Schizophrenia 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Recall Recall Recall 

Figure 1 Precision/recall curves for different prioritization methods. Precision/recall curves for GLAD4U Counts, GLAD4U Hypergeometric 
and EBIMed are colored in black, red, and green, respectively. Dashed lines correspond to the precision levels of 0.8 and 0.5. 



Regarding hypertension, renin (REN) is part of the are thought as important regulators of blood pressure 
renin-angiotensin system (RAS). Proteins in this system and are involved in the onset of hypertension [29-32]. 



Table 2 Comparison of different prioritization methods 





Apoptosis 


Cell Adhesion 


DNA Repair 


Hypertension 


Obesity 


Schizophrenia 


GLAD4U Counts 


AP 


0.4939 


0461 1 


0.6670 


0.3947 


0.5698 


0.4601 


Precision at k = 50 


0.8600 


0.8000 


0.8800 


0.4800 


0.7800 


0.5400 


Precision at k = 100 


0.8300 


0.7300 


0.8100 


0.3800 


0.5500 


0.4200 


GLAD4U Hypergeometric 


AP 


0.4942 


0.5723 


0.8139 


0.4564 


0.4782 


0.4280 


Precision at k = 50 


0.9400 


0.9000 


1 .0000 


0.5800 


0.6200 


0.4800 


Precision at k = 100 


0.9000 


0.8500 


0.9700 


0.3900 


0.5200 


0.4400 


EBIMed 


AP 


0.1567 


0.1256 


0.3517 


0.1336 


0.2673 


0.2318 


Precision at k = 50 


0.6200 


0.4800 


0.8400 


0.3137 


0.5652 


0.4423 


Precision at k = 100 


0.5980 


0.4848 


0.6700 


0.2821 


0.1586 


0.3200 



AP: Average Precision 
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Table 3 First 10 genes retrieved by GLAD4U and not listed in the gold standard lists 



Rank 


Entrez-Gene ID 
(Gene symbol) 


Score 


PMIDs* 


Apoptosis 


41 


4193 (MDM2) 


53.5212 


21051655, 21051533, 20849854, 20849851, 20832750, 20822933, 20708156, 20659896, 20657550, 20644561 


48 


1432 (MAPK14) 






49 


4609 (MYC) 


37.27.98 


20714214, 20598117, 20596624, 20573831, 20564213, 20515470, 20232342, 20071475, 19996270, 19966300 


54 


6774 (STAT3) 


35.2695 


20562100, 20514402, 20507639, 20490331, 20459702, 20447714, 20213502, 20197401, 20164027, 20154216 


77 


5580 (PRKCD) 


23.3017 


20548952, 20547768, 20471435, 20093486, 19932628, 19917613, 19875824, 19833733, 19808702, 19747914 


78 


29126 (CD274) 


23.1218 


20636820, 20617899, 20587542, 20506224, 20445553, 20363965, 19916867, 19826049, 19811426, 19794071 


79 


142 (PARP1) 


22.9308 


20940411, 20665026, 20644561, 20629644, 20564216, 20453000, 20388712, 20181890, 20177052, 20072652 


86 


406991 (MIR21) 


18.9856 


20813833, 20515755, 20514462, 20447717, 20404348, 20372781, 20371612, 20346171, 20153722, 20148895 


96 


7295 (TXN) 


16.1886 


20619274, 20430109, 20298786, 20103619, 19671194, 19566940, 19328186, 19120277, 18983687, 18848838 


100 


3621 (ING1) 


15.3784 


19085961, 18836436, 18801192, 18691180, 18655775, 18533182, 18388957, 17585055, 17379210, 16607280 


Hypertension 


10 


5972 (REN) 


61.9237 


20925572, 20662730, 20577119, 20537141, 20429690, 20223792, 20160196, 19891555, 19673942, 19536175 


12 


3291 (HSD11B2) 


45.7032 


20597806, 19811365, 19150652, 18837962, 18573267, 18178212, 17551100, 16872738, 16778331, 16109323 


14 


4879 (NPPB) 


36.9570 


20713912, 20368210, 20350538, 20346360, 20234137, 20142024, 201 13292, 20102554, 20087954, 20083731 


17 


4524 (MTHFR) 


32.2080 


21072525, 21060006, 209601 13, 20852445, 20812180, 20717043, 20669348, 20637366, 20592457, 20479155 


19 


1401 (CRP) 


31.9446 


21044781, 20805569, 20733302, 20683147, 20676960, 20346360, 203391 15, 20184533, 20074254, 20068351 


20 


4878 (NPPA) 






21 


155 (ADRB3) 


28.4824 


20831043, 20144152, 20044737, 19842096, 19779464, 19479237, 19131662, 18724972, 18510051, 18088254 


24 


1584 (CYP11B1) 


24.8304 


20708777, 20339375, 19820005, 19567537, 19082699, 18663314, 18294861, 17980006, 17296872, 17121536 


27 


59272 (ACE2) 


21.4649 


20831027, 20813695, 20679547, 20349406, 20160196, 20117991, 19926873, 19684612, 19289653, 19286756 


29 


9370 (ADIPOQ) 







* Only the 10 most recent supporting publications are shown here. See additional files 1 and 2 for the complete list of false-positive genes and their 
corresponding supporting publications. 



Overexpression of REN leads to hypertension via 
chronic overproduction of Angll [33,34], and inhibiting 
the regulators of the RAS-such as REN-is a common 
treatment for hypertension [32]. Adiponectin (ADIPOQ) 
is an adipocytokine synthesized by the adipose tissue. It 
has been proposed as a biomarker for hypertension, as 
low plasma levels correlates with higher risk of hyper- 
tension [35-38], and possibly with coronary artery dis- 
ease, kidney disease, left ventricular hypertrophy, and 
even myocardial infarction [36,39-41]. Interestingly, 
REN and ADIPOQ also present polymorphisms, which 
seem linked to therapeutic response to hypertension 
[31,40,42-46]. 

From these publications, we believe that MDM2 and 
IGN1 should be part of the apoptosis list, as well as REN 
and ADIPOQ should be part of the hypertension list. 
These results accentuate the incompleteness of the gold 
standards and suggest that GLAD4U can help in the 
completion of the gold standard lists. 

Thresholding score to enhance GLAD4U performance 

To evaluate whether thresholding the gene score can 
enhance GLAD4U performance, we acquired a broader 
list of disease-associated gene lists curated by Kohler et al. 



[47] and available from the GeneWanderer website 
(http://compbio.charite.de/genewanderer). We extracted 
32 "disease-gene families" to use as standards for evaluat- 
ing GLAD4U performance before and after thresholding. 
On average, GLAD4U performs 2.90-time better when 
genes with low prioritization scores (i.e. prioritization 
score < 2 or hypergeometric p value > 0.01) are removed, 
as illustrated by comparing the F-measures (Figure 2). The 
most increased performances were achieved for terms 
such as "prostate cancer", "obesity", and "amyotrophic lat- 
eral sclerosis" (folds of 7.28, 5.72, and 5.48, respectively) 
(see additional file 3 for the before and after F-measures, 
and corresponding fold-changes). The performances that 
least benefited from thresholding the gene list included 
"Noonan Syndrome, Costello syndrome, Cardiofaciocuta- 
neous Syndrome", "Nonsyndromic hearing loss", and 
"Chondrodysplasia punctata" (folds of 1, 1.16, and 1.17 
respectively). 

User interface 

GLAD4U uses a simple query interface for users to submit 
their queries. Any queries that are valid in a PubMed search 
can be used in GLAD4U. In the query interface, users can 
also modify the default parameters of the application, 
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Xeroderma pigmentosum 
Thyroid carcinoma, papillary 
Spinocerebellar Ataxia 
Retinitis pigmentosa 
Prostate cancer 
Obesity 

Noonan Syndrome, Costello syndrome, 
Nonsyndromic hearing loss 
Non-lnsulin-Dependent Diabetes Mellitus 
Mitochondrial complex I deficiency disorders 
Microphthalmia 
Long QT Syndrome 
Limb-Girdle Muscle Dystrophy 
Leigh Syndrome 
Leber congenital amaurosis 
Inflammatory Bowel Disease 
Hypertrophic cardiomyopathy 
Hermansky-Pudlak syndrome 
Glioma of brain, familial 
Fanconi anemia 
Essential hypertension 
Esophageal carcinoma 
Epidermolysis bullosa 
Ehlers Danlos syndrome 
Dilated cardiomyopathy 
Chondrodysplasia punctata 
Charcot Marie Tooth Disease 
Cataract, autosomal dominant 
Breast cancer familial 
Bardet-Biedl Syndrome 
Amyotrophic lateral sclerosis 
Age-Related Macular Degeneration 



0.00 



■ Hypergeometric, thresholded 
Hypergeometric 



0.20 



0.40 



0.60 



F-measure 



Figure 2 F-measure evaluations of GLAD4U before and after thresholding. F-measure evaluations of GLAD4U before and after thresholding, 
for each disease-associated gene lists. A higher F-measure indicates a better GLAD4U performance. 



including: search space (all species or restricted to human 
genes), the number of genes to present per result page, the 
maximum number of publications supporting each gene 
returned in the result page and the number of pages to 
build for each of the algorithm runs. 

The output page displays the ranked gene list and infor- 
mation associated with each gene (Figure 3). As each gene 
is identified by an Entrez-Gene ID, we use eSummary, 
another NCBI's eUtility [48], to fetch annotations for the 
gene including name, symbol and species. Publications sup- 
porting the relationship between a gene and the query term 
are listed under the gene. The publications are ordered 
based on their PubMed IDs so that the most recent publi- 
cation is listed first (see Figure 3, under the "ADIPOQ" 
gene description). As for genes, we use eSummary to fetch 
information for the publication such as title, author and 
journal name. Genes and publications are hyperlinked to 
the corresponding NCBI pages, which will-by design-open 
in a new window to avoid disrupting the result page. 

At the top of the output page, a summary of the run is 
also given: query term and options chosen, number of 
genes and publications processed, as well as a hyperlink to 



download the complete results in the comma-separated 
values (CSV) format. Although this file may be difficult to 
interpret by humans, it can be used as input for other 
computational analysis tools. For example, we have imple- 
mented a "send data to Functional Enrichment Analysis" 
link in the result page (Figure 3) of GLAD4U for submit- 
ting a gene list to the functional enrichment analysis tool 
WebGestalt [49,50]. This function is particularly handy for 
the functional interpretation of a gene list, e.g., a list 
returned by a disease term query. It could help revealing 
biological processes associated with the disease. As an 
example, enrichment analysis on the first 100 genes 
returned by the "Obesity" query linked this disease to bio- 
logical processes such as "fat cell differentiation" (20 
genes, multiple-test adjusted enrichment p-value (adjp) = 
5.27e-28), "lipid metabolic process" (39 genes, adjp = 
5.05e-20) and "response to insulin stimulus" (17 genes, 
adjp = 4.99e-18). In addition, we have also implemented a 
"visualize genes in a protein-protein interaction network" 
link, which allows the visualization of interactions among 
the protein products of the genes based on the Cytoscape 
Web utility (http://cytoscapeweb.cytoscape.org/). 
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Summary 

Generated on: February 22, 2012 

Query: obesity (Parameters used: threshold of 0.01, search only human genes, 100 genes per page, 10 publications per gene, 5 page links 

per page 

Number of publications retrieved: 157,669 

Number of publications containing gene information (among the 157,669): 4,727 
Number of genes in these 4,727 publications: 1,859 
Number of genes after the score threshold: 288 

Send data to Functional Enrichment Analysis (opens a new window) 

Visualize genes in a protein- protein interaction network 



Page 1 - Page 2 - Page 3 

Genes identified in your query, from highest to lowest scores : 
(all links will open in new windows) 
Expand all publications j 

1. ADIPOQ - adiponectin, C1Q and collagen domain containing [Homo sapiens, Entrez-Gene ID:9370] 

score : 1000, go to Entrez-Gene page, hide supporting publications 

o Siiloncn Nel al ..Association of ADIPOQ gene variants with body weight, type 2 diabetes and serum adiponectin concentrations: the Finnish Diabetes Prevention 
Study. BMC Med Genet. 201 I Jan 10; 1 2:5 - Abstract 

o Sabouri S el a\ ..Association between 45T/G polvnutrphism of adiponectin gene and coronary artery disease in an Iranian papulation , Sui-mi ruAVorUIJourmil . 
201 1 Jan 5; 11:93-101 -Abstract 

o Thvagarajan B el al ..Serum adiponectin is positively associated with lung function in young adults, independent of obesity: the CARD/A study. Respir Res .2010 
Dec 9: 1 1 : 176 - Abstract 

o Cohen SS el al ..Serum adiponectin in relation to body BOSS index and other correlates in black and white women. Ann hpi Jcmiol .21)1 1 Feb;2 1 (2):X6-94 - 

Absiracl 

o Chni KM el al ..Serum adipocyte fatty acid-binding protein, retinal binding protein 4, and adiponectin concentrations in relation to the development of the 
metabolic syndrome in Korean boys: a 3-y prospective cohort study. Ami Clin Nuir .201 I Jan:93( I ): 19-26 - Abstraci 

o Sun H el al .* Adiponectin reduces C-reactive protein expression and downregulates STAT3 phosphorylation induced by II.-6 in Hep(i2 cells. Mul Cel I Bioebem , 
201 1 Jan;347( I -2): 183-9 -Abstract 

o Dimitri Pel al ..Adipokines . bone-derived factors and hone turnover in obese children: evidence for altered fat-bi>ne signalling resulting in reduced bone mass. 
Bone . 20 1 1 Feb;48(2): 189-96 - Abstract 

o Bias Eet al ..Erythrocyte sodium-lithium cituntertran sport activity is inversely correlated to adiponectin, retinal binding protein 4 and body height . ScarnJ J 
Clin Lab Invest . 20 1 0 Nov :70(7 ):487-9 1 - Abstraci 

o Cieslak J el al ..Polymorphisms in 5'- flanking regions of genes encoding adiponectin. leptin. and resistin are not associated with obesity of Polish children and 
adolescents. Mol Biol Ren . 201 I Mar;38<3): 1 793-8 - Abstract 

o Liu J el al ..Adiponectin levels in non -obese first -degree relatives of type 2 diabetes patients and non-diabetic subjects: a 5 -year follow-up study. J Ini Med Res . 

2010 May-Jun:38(3):792-802- Abstract 

o see all supporting publications in PubMed 

2. LEP - leptin [Homo sapiens, Entrez-Gene ID:3952] 

score : 321.9260, go to Entrez-Gene page, show the first 10 out of the 300 supporting publications 

3. FTO - fat mass and obesity associated [Homo sapiens, Entrez-Gene ID:79068] 

score : 276.2372, go to Entre2-Gene page, show the first 10 out of the 184 supporting publications > 

4. MC4R - melanocortin 4 receptor [Homo sapiens, Entrez-Gene ID:4160] 

score : 214.5474, go to Entrez-Gene page, show the first 10 out of the 149 supporting publications 

5. ADRB3 - adrenergic, beta-3-, receptor [Homo sapiens, Entrez-Gene ID.155] 

score : 145.6010, go to Entrez-Gene page, show the first 10 out of the 122 supporting publications 

6. LEPR - leptin receptor [Homo sapiens, Entrez-Gene ID:3953] 

score : 140.0002, go to Entrez-Gene page, show the first 10 out of the 129 supporting publications 

7. PPARG - peroxisome proliferator-activated receptor gamma [Homo sapiens, Entrez-Gene ID:5468] 

score : 131.9575, go to Entrez-Gene page, show the first 10 out of the 195 supporting publications 

8. GHRL - ghrelin/obestatin prepropeptide [Homo sapiens, Entrez-Gene ID:51738] 

score : 1 14.4299, go to Entrez-Gene page, show the first 10 out of the 129 supporting publications 

9. RETN - resistin [Homo sapiens, Entrez-Gene ID:56729] 

score : 104.9504, go to Entrez-Gene page, show the first 10 out of the 96 supporting publications — 

10. UCP2 - uncoupling protein 2 (mitochondrial, proton carrier) [Homo sapiens, Entrez-Gene ID:7351] 

score : 89.6734, go to Entrez-Gene page, show the first 10 out of the 80 supporting publications - 

11. UCP3 - uncoupling protein 3 (mitochondrial, proton carrier) [Homo sapiens, Entrez-Gene ID:7352] 

Figure 3 GLAD4U output page. A typical result page generated by a query with GLAD4U. The summary section presents the main statistics for 
the query, along with two hyperlinked icons to download the results as an entire archive of all pages of results ("compressed" icon), a CSV 
("Excel" icon) or a text ("text" icon) file. Right below the summary, a link is available to send the results for functional enrichment analysis. In the 
main result section, the prioritized genes are presented. The user can click the "+" to show/hide the supporting publications, which are all 
hidden by default to help the read-out of the gene information. ADIPOQ gene is presented with its supporting publications as an example. 
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Discussion 

Reading through all relevant literature to generate a gene 
list is time consuming [10,51-53], a common concern 
that came up in all interviews of experimentalists that we 
performed (results not shown). GLAD4U addresses this 
problem by automatically creating a ranked list of genes 
following a user's input query. 

One important feature of GLAD4U is its information 
processing. Based on our survey among experimentalists, 
GLAD4U follows the exact same steps that an experi- 
mentalist would follow: gather literature, extract gene 
information and create an expert list [54]. Whether a 
user queries a disease, a non-disease phenotype, a biolo- 
gical process or a gene, GLAD4U will fetch correspond- 
ing biomedical publications using NCBI's eUtilities API, 
retrieve relevant gene information, rank them and send 
them back to the user. GLAD4U ensures computational 
efficiency through effective use of existing NCBI 
resources, which also made it one of the winning applica- 
tions in the National Library of Medicine (NLM)'s 2011 
Software Development Challenge on the Innovative Uses 
of NLM Information. 

Another important feature of GLAD4U is its simplicity. 
Researchers will be at ease using GLAD4U because its 
searching engine is powered by PubMed's API [48,52], 
and behaves similarly to Entrez-PubMed [55]. GLAD4U 
outputs a clean result page where the user can easily find 
genes relevant to the concept queried and supporting 
publications. Additionally, the use of PubMed's API 
makes GLAD4U almost maintenance-free. GLAD4U will 
update itself along with the MEDLINE library update. 
This will ensure that GLAD4U's results will always be 
up-to-date with the current literature. 

Several tools rely on PubMed to build disease candidate 
genes lists [5,8,52,56,57]. EBIMed [5] and FACTA [7] are 
concept-oriented applications for mining existing biome- 
dical literature. They attempt to automatically establish 
the publication-concept (including genes) relationship 
through in-house text mining tools whereas GLAD4U 
relies on the manually curated publication-gene mapping 
provided by NCBI. According to our results, manual 
mapping seems to have notable impact on performance. 
Nevertheless, automated mapping would allow flexibility 
in extending the services for concepts other than genes. 

Although using the biomedical literature as a knowl- 
edge source seems intuitive [51,58,59], certain limitations 
exist: the literature is indexed based on titles, abstracts 
and keywords, not on full-text [60,61]. Thus, a set of pub- 
lications retrieved may be incomplete (i.e., some publica- 
tions relevant to the concept queried will not be retrieved 
because they do not contain the necessary keywords in 
their titles or abstracts) [62]. There is a possible bias in 
using the biomedical literature and ontology [55], as the 



most studied genes (those with the most publications) 
will have more weight [51,63] at the expense of more 
relevant genes that might only be featured in few papers 
[64]. Thus, we use the hypergeometric test to rank genes 
based on how likely it would be to retrieve them by 
chance alone, based on the number of publications 
retrieved for this gene among the total number of publi- 
cations linked to this gene. The less likely it is-the smal- 
ler the p value-the higher the score will be for the gene. 
Thus, even if GLAD4U is solely retrieving its data from 
the biomedical literature, it prioritizes following a statisti- 
cal analysis of the retrieved data. 

The most obvious usage of GLAD4U is to generate a 
gene list for an input concept, which has been demon- 
strated in this paper. This can be extremely useful for the 
design of targeted high-throughput experiments. If one 
needs to create a custom array or selected proteins for 
targeted quantitative proteomic analysis using the 
selected reaction monitoring (SRM) assay, one can use 
GLAD4U and review the ranked list of genes that likely 
should be included in the experimental design. Besides 
generating gene lists for individual concepts, GLAD4U is 
very flexible and allows production of gene lists related 
to multiple concepts, which cannot be done by searching 
GO or OMIM databases. For example, a query of "smok- 
ing AND cancer" can generate a gene list that could 
potentially help exploring gene-environment interactions 
in cancer. GLAD4U also holds the potential to assist in 
improvement of the functional annotation of genes. 
Although GO contains more than 17,000 terms [4,65] 
and is regularly used in the bioinformatics field as a stan- 
dard [4,66], it is not complete [51,67]. Through manual 
checking of the top genes returned by GLAD4U that 
were not part of the gold standard lists, we easily found 
evidence that these genes were indeed linked to the 
query, and probably should have been included in the 
gold standard. 

Finally, because GLAD4U prioritization algorithm 
assigns scores to genes, removing the genes with a low 
score consistantly improves the quality of the results. This 
result justifies thresholding GLAD4U results by default. 

Conclusions 

GLAD4U is a freely available web-application for creating 
expert candidate gene lists tailored to a user's query. It fol- 
lows the same steps that the experimentalist would follow: 
gather literature, extract gene information and create an 
expert list. The simple interface of GLAD4U ensures easy 
usage and interpretation. Because GLAD4U relies on 
existing biomedical literature, it has an immediate credibil- 
ity with experimentalists, who use this resource as a pri- 
mary means for enhancing their knowledge and expertise. 
Although the gene list directly returned from a PubMed 
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query is usually lengthy and noisy, the prioritization 
method implemented in GLAD4U successfully ranks truly 
relevant genes at the top of the list and facilitates efficient 
browsing of the list. 

Methods 

Publication retrieval 

GLAD4U relies on the eSearch application programming 
interface (API) developed by the NCBI for retrieving pub- 
lications from the MEDLINE database [48]. For a user 
query, eSearch returns an XML file containing the num- 
ber of publications returned by the query and all publica- 
tion identification IDs (PMIDs). The XML file is parsed 
to get the list of PMIDs associated with a user query. 

Gene retrieval 

Genes associated with PMIDs are retrieved based on the 
gene-to-publication link table provided by Entrez-Gene 
[68]. Links between Entrez-Gene IDs and PMIDs are cre- 
ated based on both manual curation within the NCBI 
and integration of information from other public data- 
bases. Publications linked to more than 500 genes are 
removed from the link table because they lack specificity. 
After this process, the link table included 3,509,732 genes 
and 647,523 publications for all organisms, among which 
30,343 genes and 306,487 publications were related to 
human (as of 05/14/2011). 

Gene prioritization 

We studied two methods to prioritize the retrieved genes 
based on publication counts or the hypergeometric test. 
To prioritize using counts ("GLAD4U Counts"), each 
gene receives a score equal to the number of publications 
describing it in the link table. The other method 
("GLAD4U Hypergeometric") uses the hypergeometric 
test to prioritize all retrieved genes. Specifically, for a 
given query Q and a gene G, let n be the number of pub- 
lications retrieved for the query and present in the gene- 
to-publication link table (query-relevant publications) 
and k be the number of query-relevant publications that 
involves the gene G. Let us further assume that there are 
m publications in the gene-to-publication link table, / of 
which involve the gene G (gene-relevant publications). 
This method calculates the probability of observing k or 
more query- relevant publications for the gene by chance, 
based on the hypergeometric test and scores the gene 
using the following formula: 
Sc = -\o^' n ' hk \ where 

min(n,j) I m \ ) I \ ) 
J(m,«,j,k) 2-j / m \ 



Performance evaluation 

We used GO and disease terms as queries to evaluate 
the performance of the GLAD4U algorithms. Gene 
lists curated in GO, OMIM and GAD [69] were used 
as a gold standard (i.e. relevant genes). We developed 
a Perl script to parse the files "gene2go.gz" [68] and 
"gene_ontology.l_2.obo" [70] in order to generate gene 
lists for GO terms (as of 12/20/2009). Because of the 
parent-child relationship among the GO terms as 
described in the GO Direct Acyclic Graph, genes with 
granular annotations were associated with their parent 
terms using the Perl script. Using GAD, we identified 
all genes associated to a disease term. Using OMIM, 
we retrieved all IDs prefixed with "%" and "#" with the 
query in the title. Corresponding gene IDs were 
mapped by parsing the file "mim2gene" [68] (as of 12/ 
22/2009). For each disease term, the lists obtained with 
GAD and OMIM were merged to serve as a gold stan- 
dard. Retrieval performance was evaluated using preci- 
sion, recall and F-measure. The F-measure is calculated 
by 2prl{p+r), where p is the precision defined as 

[{relevant genes) n {retrieved genes}\ / [{retrieved genes} \ and r is the re- 
Call defined as [{relevant genes] n {retrieved genes}\/ [{relevant genes}\. 

We used the precision/recall curve, average precision 
(AP) and precision at the top k retrieved genes (k = 50 
and k = 100) to evaluate the performance of our gene 
prioritization methods, and compared it to the perfor- 
mance of the ranked lists generated by EBIMed [6]. 
All performance values are expressed in the text as 
mean ± standard deviation. 

Web implementation 

The GLAD4U user interface was developed in HTML and 
PHP languages. The scripts to deploy and update the algo- 
rithm on web servers were written in Perl, while the gen- 
eration of hypergeometric test scores is using C. JQuery 
was used to implement user-features such as the ability to 
hide/ show options and functions. An email notification 
module was implemented to allow users to retrieve their 
results at a later time. GLAD4U (http://bioinfo.vanderbilt. 
edu/glad4u) is platform-independent and under a GNU 
GPL license [71]. It was tested on Internet Explorer 5.0, 
Firefox 3.0, Safari 3.0, Chrome, Netscape 7 or any higher 
versions of these browsers. 

Additional material 



Additional file 1: False-positive genes retrieved by querying 
"apoptosis" with GLAD4U. This table shows all genes retrieved by 
GLAD4U with the query "apoptosis" that were not among the gold 
standards. The table presents the rank and score of these genes and al 
the retrieved supporting publications. 

Additional file 2: False-positive genes retrieved by querying 
"hypertension" with GLAD4U. This table shows all genes retrieved by 
GLAD4U with the query "hypertension" that were not among the gold 
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standards. The table presents the rank and score of these genes and all 
the retrieved supporting publications. 

Additional file 3: GLAD4U prioritization of disease candidate genes 

This table shows the number of genes associated with each 
GeneWanderer hereditary disease, retrieved by GLAD4U and overlapping 
between the two lists before and after thresholding. F-measure fold 
change between the GLAD4U prioritized list before and after 
thresholding, as well as the actual F-measures are also displayed in the 
table. 



List of abbreviations used 

ADIPOQ: adiponectin; API: application programming interface: CSV: comma- 
separated values: GAD: genetic association database: GLAD4U Counts: 
GLAD4U prioritization algorithm using counts: GLAD4U Hypergeometric: 
GLAD4U prioritization algorithm using the hypergeometric test: GLAD4U: 
gene list automatically derived for you: GO: gene ontology: GOTM: GOTree 
Machine: ING1 : inhibitor of growth family: member 1; AP: average precision; 
NCBI: national center for biotechnology information: OMIM: online 
mendelian inheritance in man: PMIDs: publication identification IDs: REN: 
renin; SRM: selected reaction monitoring. 
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